{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Regresión logística\n", "\n", "Es un algoritmo para obtener un clasificador binario. \n", "\n", "La regresión logística es bastante efectiva en situaciones en las que la relación entre la **probabilidad** de lograr una meta/objetivo (Y) está vinculada a los recursos necesarios (X) de manera no lineal donde una disminución/aumento de cierto recurso más allá de cierto umbral disminuye/aumenta drásticamente la probabilidad de lograr el objetivo.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Los clasificadores binaros basados en regresión logística clasifican las observaciones de acuerdo a un umbral típicamente 0.5 (50%).\n", "\n", "Hay dos técnicas comunmente empleadas para obtener los coeficientes de regresión. __[MLE](https://es.wikipedia.org/wiki/M%C3%A1xima_verosimilitud)__ y __[mínimos cuadrados](https://es.wikipedia.org/wiki/M%C3%ADnimos_cuadrados)__ (luego de convertir la relación establecida por la curva \"S\" a una relación lineal)\n", "\n", "\n", "__[Scikit Learn - Regresión logística](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)__" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import os" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "df_entrenamiento = pd.read_csv(os.path.join(\"csv\", \"train.csv\"), index_col='PassengerId')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_entrenamiento.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hacemos limpieza de las columnas que no son necesarias para este ejercicio." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "df_entrenamiento = df_entrenamiento.drop(['Ticket', 'Embarked', 'Cabin'], axis=1)\n", "df_entrenamiento = df_entrenamiento.dropna()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "X = df_entrenamiento.loc[:,'Age':].to_numpy().astype('float')\n", "y = df_entrenamiento['Survived'].ravel() " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(571, 4) (571,)\n", "(143, 4) (143,)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n", "print(X_train.shape, y_train.shape)\n", "print(X_test.shape, y_test.shape)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# crear el clasificador\n", "clasificador_reg_log = LogisticRegression(random_state=0, solver='liblinear')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(random_state=0, solver='liblinear')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# entrenar el clasificador\n", "clasificador_reg_log.fit(X_train,y_train)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy del clasificador - version 1 : 0.64\n" ] } ], "source": [ "print('accuracy del clasificador - version 1 : {0:.2f}'.format(clasificador_reg_log.score(X_test, y_test)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### El hiperparámetro 'penalty'\n", "__[L1 Norms versus L2 Norms](https://www.kaggle.com/residentmario/l1-norms-versus-l2-norms)__\n", "\n", "__[L1 and L2 Regularization Methods](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c)__" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "#evaluar el desempeño\n", "from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy del clasificador - version 1 : 0.64\n", "matriz de confusión del clasificador - version 1: \n", " [[69 10]\n", " [41 23]]\n", "precision del clasificador - version 1 : 0.70\n", "recall del clasificador - version 1 : 0.36\n", "f1 del clasificador - version 1 : 0.47\n" ] } ], "source": [ "# accuracy\n", "print('accuracy del clasificador - version 1 : {0:.2f}'.format(accuracy_score(y_test, clasificador_reg_log.predict(X_test))))\n", "# confusion matrix\n", "print('matriz de confusión del clasificador - version 1: \\n {0}'.format(confusion_matrix(y_test, clasificador_reg_log.predict(X_test))))\n", "# precision \n", "print('precision del clasificador - version 1 : {0:.2f}'.format(precision_score(y_test, clasificador_reg_log.predict(X_test))))\n", "# recall \n", "print('recall del clasificador - version 1 : {0:.2f}'.format(recall_score(y_test, clasificador_reg_log.predict(X_test))))\n", "# f1\n", "print('f1 del clasificador - version 1 : {0:.2f}'.format(f1_score(y_test, clasificador_reg_log.predict(X_test))))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.02455679, -0.3946987 , 0.1850001 , 0.02286013]])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# coeficientes del modelo\n", "clasificador_reg_log.coef_" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Age', 'SibSp', 'Parch', 'Fare'], dtype='object')" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_entrenamiento.loc[:,'Age':].columns" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Age', -0.024556791350753466),\n", " ('SibSp', -0.3946987047411627),\n", " ('Parch', 0.1850001010903358),\n", " ('Fare', 0.02286012707033761)]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip(df_entrenamiento.loc[:,'Age':].columns, clasificador_reg_log.coef_[0]))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }